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Abstract 

Unlike static documents, version-controlled documents are edited by one or more authors over a cer- 
tain period of time. Examples include large scale computer code, papers authored by a team of scientists, 
and online discussion boards. Such collaborative revision process makes traditional document model- 
ing and visualization techniques inappropriate. In this paper we propose a new visualization technique 
for version-controlled documents that reveals interesting authoring patterns in papers, computer code 
and Wikipedia articles. The revealed authoring patterns are useful for the readers, participants in the 
authoring process, and supervisors. 

1 Introduction 

Version-controlled documents are usually authored by several users and updated over a certain period of 
time unlike static documents. One instance of version-controlled documents are large software projects 
developed by teams of software engineers over a period of several weeks. Another example of version- 
controlled documents are scientific papers written by teams of scientists across multiple geographic location. 
A third instance of such documents is online discussion boards like Slashdot or Google Wave. In each case 
the authoring process is composed of a sequence of document revisions annotated with the date, the identity 
of the author, and occasionally revision comments. 

The importance of such collaboratively authored documents has recently increased substantially with 
the availability of collaborative productivity tools such as Subversion/GIT, Google Docs, MS Office 2010, 
and online forums like Wikipedia and Wordpress. These tools or websites maintain a complete revision 
history which may be used to recreate the entire authoring process (rather than just the final document). 

Visualizing version-controlled documents has a number of important applications. It may assist authors 
or code developers in determining what is the current project status and what they should work on next (either 
revise or avoid revising). It may assist managers in ensuring that the code or document development process 
progresses adequately, and if not identify the problem. For example, are there authoring inefficiencies such 
as certain authors consistently overwriting their colleagues. It may also be used to expose collaborative 
authoring patters leading to disinformation, which is a serious problem in the Wikipedia project. 
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In this paper we develop a representation, called cumulative revision map (CRM), for visualizing version- 
controlled documents. Most well known document visualization techniques are appropriate only for static 
documents, but they are unable to capture patterns in the authoring process. Our approach displays the entire 
revision history as a two dimensional diagram where rows correspond to revisions and columns to document 
position. The spatial arrangement of additions and deletions reveals informative authoring patterns. After 
explaining the CRM visualization method we demonstrate it on a combination of synthetic documents and 
real world documents including Wikipedia articles, scientific papers, and computer code. We also compare 
the CRM visualization technique to related work such as History Flow and discuss their pros and cons. 



2 Related Work 

Several attempts have been made to visualize themes and topics in documents, either by keeping track of 
the word distribution or by dimensionality reduction techniques e.g.. lfl0l[T4ll2Tll24l . Such studies tend to 
visualize a corpus of unrelated documents as opposed to ordered collections of revisions which we explore. 

Document visualization has gained considerable real world and research interest due to the inherent 
complexity of text and the overwhelming extent of digital text archives such as the Internet. Collections 
of version-controlled documents, such as code repositories and Google docs, compound these challenges 
by storing documents as they evolve over time and by several authors. Techniques for visualizing version- 
controlled documents tend to focus more on temporal and collaborative aspects and less on content. 

A partial list of references for text visualization are j2Tl [T3l [T4l [TUl l26l 13 with additional references 
available in 11231 . A selection of software systems for visualizing text corpora are IN-SPIRE^] Jigsaw^] 
Enron corpus vieweiFJ Thomson's refvi^j and the Science topic browseiF] 



2.1 Visualizing Word Histograms 

Visualizing numeric data, such as word histograms, serves a foundational role in visualizing complicated 
textual objects. Monographs describing traditional visualization techniques are [3] [25J while less traditional 
approaches for visual data exploration are surveyed in |@). Some interesting ideas concerning visualizing 
low-dimensional numeric time series are ll28l IT3TI . Recent trends in the area of time series visualization 
are mostly concerned with interactive visualization and with multiple or vector-valued time series. An 
interesting exposition of the state-of-the-art and future vision in the related field of visual analytics is [23]. 

The use of n-grams to convert categorical sequences to numeric vectors is used extensively in the fields 
of information retrieval, speech recognition, and natural language processing. Recent monographs describ- 
ing the use of n-grams in these areas are |[T6l[T8l [TTl. Visualizing n-grams is usually accomplished through 
statistical dimensionality reduction techniques. Methods such as principal component analysis and multidi- 
mensional scaling are surveyed in [9] while iMTl reviews non-linear techniques for dimensionality reduction. 



2.2 Visualizing Version-controlled Documents 

Visualizations for version-controlled documents primarily focus on programming (code) repositories rather 
than more traditional documents. Although traditional document authoring is fundamentally different in 
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style and content, one could imagine using these techniques for depicting the life-cycle of general docu- 
ments. 

History Flow and Text-animated-transition fZ7[ [8] are outstanding techniques fir visualizing the author- 
ing process in a single document. This is also the approach we take in our paper. Other approaches attempt 
to visualize an entire repository containing a large number of documents. The resulting visualization focuses 
on which files are edited by whom and when rather than how the text content changes. 

Good overviews of techniques for visualizing the software evolution process are 621 HI . Specific ex- 
amples include SeeSoft Q, a line-by-line visualization of source code, as well as Augur ifTTIl and Advizor 
l6l . The latter two are collections of visualizations, such as 2D and 2.5D matrix views which identify file 
and source changes in terms of project branch, date, author, etc. Cenqua Fisheyd^Jis one such commercial 
tool for visually interacting with software repositories, however the interface consists of text-centric and 
graphical displays displaying line charts and histograms. The StarGate project (19\ and CodeSawj^] serve 
roles similar to FishEye, i.e. tracking where and to what extent authors are concentrating their efforts, but 
provide a less static presentation. 

An increasing number of recent visualization techniques emphasize aesthetics and result in a more qual- 
itative rather than quantitative presentations of information. Organic visualizations [12] use non-standard 
visual mechanisms, such as swirling clouds and blooming flowers, in which data members interact to exhibit 
emergent structure. Additional examples include gourcd^and code_swarm ll20ll . 

Most of the tools above are primarily intended to provide an understanding of the evolution of a collec- 
tion of documents. Conversely, we present techniques for visually exploring the life-cycle of a single one 
document. Obviously both serve fundamentally different roles and answer different questions. The former 
presents a high level overview picture of the evolution of a document repository while the latter provides 
more detailed view concerning the authoring process of a specific document. 

3 Cumulative Revision Map 

In this section, we introduce the Cumulative Revision Map and the precise techniques used to generate the 
visualization. We provide additional arguments for our design choices by contrasting with the Unix tool 
"diff" and History Flow, the visualization techniques most similar to this work. 

3.1 Data Reduction Principles 

"Diff" has been a mainstay of the Unix system since its inception in the 70s and arguably remains the most 
useful general-purpose revision visualization system. Diff input consists of two files, a reference document 
and a proposal document, and outputs a sequence of line by line edits, i.e., add and delete. Such edits, if 
applied to the reference file, would exactly yield the proposal file. Since there are an infinite number of 
possible edit sequences, diff solves the longest common subsequence (LCS) problem to characterize a sense 
of minimal change needed to realize the proposal file. 

The defining characteristic of diff is its conveyance of only the differences between two documents. 
We abstractly refer to this difference as the document-document delta. By presenting only the delta, the 
amount of data the user must interpret is reduced, often significantly, and allows him or her to form a mental 
picture of the change between any two revisions and how this change is correlated with other events, such 
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as authorship, content, or time. Such an approach has the advantage of being lossless, given the reference 
document. 

In principle, one could examine multiple revisions through repeated application of diff. Unfortunately, 
such an approach is ill-suited for many modern collaborative settings where there may be hundreds or 
thousands of revisions over the course of months, years, or even decades. In these cases, the complexity of 
data is characterized not just by the delta, but by the number of such deltas. A sequence of deltas rapidly 
becomes overly complicated and belies the data-reduction principles motivating the use of diff in the first 
place. As a consequence, the user's ability to make high-level characterizations of the document's evolution 
is compromised-a task vitally important in settings where there may be tens or hundreds of authors (e.g., 
Wikipedia). 

The CRM fundamentally draws upon the same data reduction principles of diff, but extends this delta- 
type reduction to the time domain. As with diff, CRM depicts only the changes (between documents) but 
unlike diff, the characterization is graphical rather than textual. Through this presentation, the CRM is 
capable of representing 100s or 1000s of revisions while simultaneously representing the changes to the 
document in their entirety. 

We make these notions precise by characterizing a revision delta as a 4-tuple, denoted A = (£ ,V, X ,y). 
Here £ = {delete, add, delete • add} denotes the possible edit operations, V = {"string"} denotes the edit 
payload (possibly empty), X the position in document, and y the revision number. Denoting an instance of 
A as 5 one can encode the life-cycle of any document as a sequence of delta instances, (<5j)j 6 /. For example, 
adding a LaTeX section header after the 271-th token of revision 8 is denoted as 

(add, "\section{ Introduction}", 271, 8). 

More generally a delta could also contain meta-information such as author or IP, however we omit this for 
simplicity. 



3.2 Schema 



As motivated by Sec. 3.1 the life-cycle of a document can be efficiently encoded as a sequence of revision 
deltas (5i)i£i each an instantiation of the 4-tuple A = {£, V, X, y) with members corresponding to an edit 
operation, payload, position, and time (respectively). 

The cumulative revision map visualizes this delta sequence as color-coded elements of a matrix indexed 
by both position in document and revision number. We refer to these indices as space-time coordinates 
where spatial position refers to a particular position in the document (e.g., word count, line number, or byte) 
and temporal position refers to a particular revision number. For example, Figure [7] (bottom) depicts the 
revision history of an actual conference paper. The CRM x-axis represents space with the left-most column 
representing the document beginning and the right-most the document end. Time is characterized by the 
y-axis with each row indicating a subsequent revision; the top-most row corresponds to the first revision and 
the bottom-most row the most recent revision. 

The basic edit operations are coded by color; adds are gray, deletes are red. Since the CRM graphi- 
cally depicts deltas by color-coded space-time coordinates, only the payload information is not graphically 
depicted. This information is conveyed interactively via a simple pop-up mechanism (Fig. [3} attached to 
each mauix element. The pastel, horizontal bands indicate which author made the changes and vertical 
bands correspond to sections (when such a construct exists in the document). Tracing the faint vertical lines 
throughout shows the portions of the document which persist to the most recent revision. 

The CRM can be interpreted in several different ways. Much like its diff analog, a particular revision 
can be recovered by applying the edit operations (top-down) until the revision of interest is reached. More 
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generally, the visualization scheme naturally depicts a high-level characterization of how often the document 
was edited and with what locality. Through the pop-up mechanism, users are able to obtain precise edit 
details (e.g., the actual diff output) by simply clicking on the element in question. 

3.3 Design 

CRM maintains entire addition and deletion history of a document. The history is maintained by a graph 
structure with each nodes containing a subsequence of a document. In each revisions, CRM solves a Longest 
Common Subsequence problem (like the unix diff) between sequence at a revision and the previous one. 
LCS finds the minimal addition and deletion(revision delta) between two revisions; which is the edit history 
at certain revision. Using those revision deltas, CRM updates its graph preserving cumulatively. Each update 
will have those operations: unchanged part intact, splitting relevant node to add new content and to delete 
some parts of a node. The full algorithm is described in Algorithm [T] 

Algorithm 1 The process of Cumulative Revision Map 

G = empty graph 
for each revision do 

solve LCS problem to find additions and deletion 
for each edit operation do 
find relevant node in G 
if addition then 

if add at the beginning then 

make a new node with new content and make a link 
else if add at the ending then 

make a new node with new content and make a link 
else 

split the node in the position of addition 
make a new node with new content 
make links to split nodes 
end if 
else if deletion then 

split the node to separate the deleted part 
mark the node dead 
attach the node to the other node 
end if 
end for 
end for 

layout the graph G 

draw cumulative-change bars 



An illustrative example of the process of CRM is shown in Figure [T] An integer in the node is the 
contents of the node, which is a subsequence of document. Gray boxes indicates persisting parts and reds 
for non-persisting parts. The arrow connecting gray nodes means the sequential flow of the latest revision 
of the graph. 

At revision 2, 6 was added. CRM finds relevant node, the one with {1, 2, 3, 4, 5} and split it to {1, 2, 3} 
and {4, 5} to insert a new node {6} in the right position. The updated graph will maintain the document 
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Figure 1: Cumulative Revision Map process, (revl) Started with a document (1,2,3,4,5). (rev2) The node 
was split to insert a new content. (rev3) Node '123' was split to mark the deleted node. (rev4) Node '45' 
split and new node '7' was inserted. 

Figure 2: The top-down projection of persistent nodes of revision 4 of the graph of Figure [I] It is exactly 
the same contents of the document at latest revision 



at revision 2. We can confirm the contents of the document while following the gray nodes connected with 
arrow edges. Revision 3 has a delete operation. Node {1, 2, 3} is split to mark the deleted subsequence {1}. 
Revision 4 has one deletion and one deletion. Both operation is done separately without interfering each 
other. As a result, the final CRM maintains its contents as well as its entire edit histories. 

The vertical position of a node in CRM follows the added revision of the contents. The horizontal 
position of a node follows relative position for the latest revision. As a result, the top-bottom projection of 
persistent nodes result the document of latest revision. (Figure [2]) 



3.4 Scalablity and Interactivity 

CRM is scalable and interactive. The nodes and edges could be simplified to give a clear representation of 
large datasets. Edges could be changed to vertical lines while shrinking all horizontal gaps between nodes 
as shown in top section of Figure[5] This simplification approach also gives more concise document location 
along horizontal layout. Moreover, user can intuitively pinpoint a node with mouse pointer to find out what 
was written and when the change was made. (Figure [3]) 
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3.5 Comparison to Related Work 

CRM is distinct from other revision visualization systems in its visual characterization of a revised docu- 
ment. By exploiting the sparsity present in a delta encoding of the document lifecycle, the graphical depic- 
tion preserves relevant information in a manner that is naturally sparse yet retains relevant information. This 
sparsity also facilitates scalability, i.e., the entire document lifecycle can be visualized-a characteristic not 
present in many related works. 

One preexisting work similar to the CRM is History Flow (HF) [27]. Like the CRM, HF represents 
a document's spatial characteristics (e.g., tokens) and temporal characteristics (e.g., revisions). Although 
both CRM and History Flow depict revision data, the two approaches differ fundamentally. CRM implicitly 
represents a document through visualizing only the deltas, while History Flow represents the document in 
its entirety at each revision. Loosely speaking, one can think of the columns of HF as a snapshot of the 
document at a revision while the CRM can be regarded as a difference between snapshots. 

Indeed, the CRM can be transformed into HF by accumulating the rows (and transposing the result). 
However, HF cannot quite be transformed into CRM. The differences between HF columns would bare 
some semblance to the CRM, however, the natural sparsity in the CRM encoding allows the visualization of 
secondary information, e.g., edit persistence, section, etc., without overwhelming the visualization scheme. 
To encode this data, the HF scheme relies on different visualization modalities through a user interface. 

Maintaining changes has several advantages over maintaining all snapshots. First, it is simpler. Without 
redundancy, the amount of information for a version-controlled document is smaller than its snapshot-based 
counterpart. Such space-efficiency is critical for version-controlled documents with long histories. Second, 
the CRM exploits whitespaces information in a manner that conveys information and yields a representation 
which easier to conceptualize. 

From a purely visual perspective, the fundamental differences between the two approaches are depicted 
in Figure [4] Here we represent the two atomic edit operations: add and delete. Note, the CRM scheme 
represents add persistence through a graph-like structure while HF visualizes the document column-wise. 
Perhaps more noteworthy, is the distinction between operations. CRM atoms have a distinct structure while 
HF atoms are essentially reflections and/or rotations. The obvious structural difference in CRM atoms make 
for quicker more intuitive interpretation. 
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(a) Add at the beginning of 
the contents 



(b) Delete a part at the beginning of the con- 
tents 



(c) Add at the be- 
ginning of the con- 
tents 



(d) Delete a part at the 
beginning of the con- 
tents 



(e) Add at the end of the 
contents 



(f) Delete a part at the ending of the contents 
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(g) Add at the end 
of the contents 



(h) Delete a part at the 
end of the contents 



(i) Add at middle of the con- 
tents 



(j) Delete a part at middle of the contents 



(k) Add at middle 
of the contents 



(1) Delete a part at 
middle of the contents 



Figure 4: Atomic edit operations on CRM (left two column) and on History Flow (right two column). Every 
atomic operations on CRM is unique while History Flow uses horizontal reflections. 
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Figure 5: Visualizations of a synthetic document of History Flow website, using CRM(left) and History 
Flow (right). The document has three authors and four revisions. Gray boxes of CRM(left), linked by verti- 
cal lines, indicates persistence contents and red for non-persistent contents. Horizontal band of background 
color indicates author. Top horizontal bar with spectrum shows degree of cumulative changes in document 
location: brighter color with higher change. Vertical segmented bar on the right side of CRM means cumu- 
lative change in a revision: brighter colors with higher changes. Color bands of History Flow(right) shows 
the author of relevant content. Horizontal position shows the revision and vertical lengths shows the length 
of the document at the revision. 

4 Evaluation 

We demonstrate our visualization technique using several case studies. These studies include a small size 
synthetic document, scientific paper written in LaTeX, computer code, and Wikipedia articles. In each of 
these cases, the revision history was obtained from collaborative tools like Subversion and Wikipedia We 
focus on the following performance criteria: (a) how easy it is to determine what is the revision in which 
some change occurs, (b) how easy is it to track what part of the document is changed, (c) which part of the 
document is frequently edited (d) what was the content that was changed, (e) what was is the style or pattern 
of the authoring process. 

4.1 Comparison between CRM and History Flow 

In this section, we outline some key differences between CRM and History Flow by visualizing a synthetic 
document and a Wikipedia article. 

4.1.1 Synthetic Document 

Figure [5] shows a visualization of a short synthetic document using both CRM and the most closely related 
previous work (History Flow). The synthetic document is the same one that appears in the History Flow 
website0 

The synthetic document has four revision and three authors, and the edit history is the following. User 
B added content towards the end of the document in the second revision. User C deletes some of the content 

http : / /www .research . ibm . com/ vi sual /pr o ject s /hist or y_f low /explanation . htm 
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from the second revision and adds a shorter content instead. Finally, user B adds some content at around the 
25% percent document position. 

CRM (figure [5] left) nicely describes the editing patterns. The background colors help to understand 
which author is active in which revision. The gray boxes indicate persistent changes and red content indicate 
non-persistent content (removed afterwards in a future revision). Following the gray boxes left to right 
according to the edges in the figure, we can traverse the final revision of the document and understand 
which part of that final document was authored by which author and at what time. Specifically, the bottom- 
left box indicates that there was an addition at revision 4 (bottom row) by user B around the 25% document 
position. The boxes at revision 2 (second row, right side) show user B added some contents in the end. 
Seeing which content persists and which not (red boxes in first row) is very easy. Finding the document 
positions containing the heaviest editing across revision is easy since the horizontal coordinate indicates the 
document position and is directly comparable across multiple rows. Moreover, the row on top of the CRM 
indicates the document positions most heavily edited. 

In contrast, the History Flow visualization (figure [5j right) shows more accurately the relative change 
of document portions but makes it hard to keep track of document positions across multiple revisions. The 
horizontal dimension corresponds to revisions and the vertical dimension corresponds to document position 
(color shows author). The expansion and shrinking of the vertical axis shows the changing length of the 
document. This may cause confusion as it is hard to compare authoring patterns at different document 
positions (the horizontal position along the different rows are not comparable). This difficulty of comparing 
revisions at specific document positions would substantially increase as we have more revisions and more 
authors. In particular the History Flow visualization of the complete revision history for long documents 
would exhibit drastic stretching and shrinking which makes it easy to analyze the shift between one revision 
and the next but makes it hard to detect more global patterns. 

4.1.2 Wikipedia Article 

We turn now to visualizing the authoring process of the Wikipedia article Information .Visualization (146 
revisions). This document shows substantial revision activity. For example, some authors wrote content 
at the beginning that was deleted about halfway through the revision history. Then one user wrote a long 
version of the article which was substantially trimmed around 80% of the revision history, after which more 
content was added to create the present document (March 2011). 

Figure[6]shows the CRM (top) and History Flow visualization of this article. In the CRM case, the upper 
half of the CRM is almost all in red indicating all content added in the early revisions was later removed. 
Following left to right along the white edges connecting the gray boxes we can easily track of the revision 
in which different parts of the final version was authored at. For example, the very first part of the present 
document comes from about 15 revision before. Almost all gray boxes (persistent changes) are located at 
the later half of the revision history. The backgrounds at the later half are colored with pink and light blue 
implying that the present document was authored mostly by two authors. The bar on top of the CRM shows 
high editing activity (yellow color) on the history section and the middle part of the overview section. 

The History Flow visualization (bottom figure) shows also that there was a major change in the middle 
of the revision history and the transition from one group of authors in early revisions to another group 
in later revisions (indicated by colors changing as we traverse the figure left to right). However, due to 
extreme shrinking and expanding it is virtually impossible to determine which part was edited in different 
revisions and how the vertical dimensions relate to each other. As can be seen this is especially difficult for 
long documents with many active revisions. As a result, it is impossible to determine which parts of the 
document were most heavily edited throughout the revision history. 
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Figure 6: Visualizations of Wikipedia article Information_Visualization, using CRM(top) and History 
Flow(bottom). The document has 146 revisions with substantial activity. Some authors wrote content at 
the beginning that was deleted about halfway through the revision history. Then one user wrote a long ver- 
sion of the article which was substantially trimmed around 80% of the revision history, after which more 



content was added to create the present document. Details at Section 4.1.2 
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4.2 CRM Case Studies 



We demonstrate the CRM visualization on three real world case studies. The case studies show how the 
CRM may be used to reveal low level and high level collaborative authoring patterns. 

LaTeX Conference Paper 

Figure|7](bottom) shows the CRM of a conference paper written in LaTeX by two of the authors of this paper. 
The revision history was obtained from a subversion repository. We make the following observations. 

1 . There is a striking diagonal editing pattern. This indicates a sequential editing style. In other words 
the authors worked their way from the beginning at early revisions towards the middle of the paper in 
middle revisions to the end of the paper in the final revisions. 

2. Author 1 (green horizontal bands) authored a relatively little part of the document (around the middle) 
and is often being overruled by author 2 (purple horizontal bands). This is indicated by red color which 
corresponds to edits that are later removed or replaced with other content. This authoring pattern is in 
agreement with the fact that author 2 is the advisor of author 1 and exhibited a hands on authoring of 
the paper. 

Journal Conference Paper 

Figure [7] (top) shows the CRM of a journal paper written in LaTeX by two of the authors of this paper. The 
revision history was obtained from a subversion repository. We make the following observations. 

1 . The middle part (method 6 and experiment sections) had the most editing and re-editing by far. On 
the other hand, the introduction, method, related work, and discussion were not revised much after 
their initial authoring. 

2. Both authors contribute significantly to the authoring process. Interestingly, the authors do not work 
much in parallel. Author 1 (green) starts for 10 revisions, author 2 continues for 40 or so revisions, 
and author 1 resumes the authoring process (with a few exceptions) until the end when both authors 
make a final pass. The authoring process is very different from the striking diagonal pattern of the 
conference paper (Figure [TJbottom) 

Computer Code 

Figure [8] (bottom) shows the CRM of Java computer code from a Google open source project (GWT). We 
make the following observations. 

1. The code is being repeatedly overhauled in a significant way. The many red rectangles represent non- 
persistent changes (edits that do not remain all the way to the final version). Indeed, it seems that the 
entire first 28 revisions were completely rewritten in the next 20 revisions. 

2. The lack of activity in the beginning of the document (left part does not contain gray or red rectangles) 
correspond to documentation that is left unchanged. (It is actually a copyright description.) 

3. The computer code shows a large deviation from the authoring patterns described in the previous two 
cases. The computer code was edited in parallel by a large of authors each working on a separate part 
of the code. 
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Presentation Slides 

Figure [8] (top) shows the CRM of LaTeX code corresponding to presentation slides. We make the following 
observations. 

1. The presentation was authored by a single author who makes four general passes over the slides. Each 
pass corresponds to a horizontal or nearly horizontal sequence of edits. 

2. Each of the passes show a sequential editing pattern (from start to end) indicated by downward diag- 
onal editing patterns. 

3. The first pass contained relatively light editing (very rough draft) while the other three passes con- 
tained more edits. The last pass added a substantial content to the end of the presentation. 

5 Summary and Discussion 

The past 10 years have seen a substantial increase in the availability and popularity of collaborative authoring 
tools such as subversion/GIT, MS Office 2010, Google Docs, and Internet wikis and discussion boards. As 
a result collaborative authoring of documents is becoming more popular and is expected to become even 
more so in the near future. As the numbers of authors and revisions increase so does the difficulty of 
understanding the authoring process, both during the authoring stages and in retrospect. Questions such 
Which author wrote which part? Who removed and edited my contribution? etc., are becoming increasingly 
hard. 

Thus far most visualization techniques have focused either on visualizing a large corpus of documents 
or the sequential trends within a single document. An important exception is the History Flow project which 
is related to our work but focuses on visualizing relative movements of code chunks. Our CRM framework 
differs by allowing effective visualization of authoring patterns with an emphasis on maintaining a visual 
relationship between editing patterns and absolute document position. 

The CRM framework can be used to discover low level and high level authoring patterns. Examples 
of such low level authoring patterns are which sections are edited heavily in which revision, which authors 
are more active than others, and what parts of the document are re-edited multiple times. Examples of high 
level authoring patterns are sequential vs parallel authoring, which authors re-edit the text of other authors, 
and identifying large structural changes such as section rearrangements. 
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Figure 7: A journal paper(top) and a conference papers(bottom) maintained in SVN. In each CRM, doc- 
ument location flows from left to right, and revisions does top to bottom. Gray boxes represents contents 
in use in latest revision, and red means deleted contents. Lines are connecting gray boxes along with the 
content of latest revision. Vertical background bar represents section and horizontal backgrounds shows au- 
thors with unique color. Top horizontal bar with spectrum shows degree of cumulative changes in document 
location: brighter color with higher change. Vertical segmented bar on the right side of each figure means 



cumulative change in a revision: brighter colors with higher changes. Details at Section 4.2 
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Figure 8: A proposal slide (top), and a Google Tool Kit Compiler source file (bottom). In each CRM, doc- 
ument location flows from left to right, and revisions does top to bottom. Gray boxes represents contents 
in use in latest revision, and red means deleted contents. Lines are connecting gray boxes along with the 
content of latest revision. Vertical background bar represents section and horizontal backgrounds shows au- 
thors with unique color. Top horizontal bar with spectrum shows degree of cumulative changes in document 
location: brighter color with higher change. Vertical segmented bar on the right side of each figure means 



cumulative change in a revision: brighter colors with higher changes. Details at Section 4.2 
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